{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"https://www.kaggle.com/code/peremartramanonellas/rouge-evaluation-untrained-vs-trained-llm?scriptVersionId=168566619\" target=\"_blank\"><img align=\"left\" alt=\"Kaggle\" title=\"Open in Kaggle\" src=\"https://kaggle.com/static/images/open-in-kaggle.svg\"></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# ROUGE Metrics\n",
    "The metrics that we have been using up to now with more traditional models, such as Accuracy, F1 Score or Recall, do not help us to evaluate the results of generative models.\n",
    "\n",
    "With these models, we are beginning to use metrics such as BLEU, ROUGE, or METEOR. Metrics that are adapted to the objective of the model.\n",
    "\n",
    "In this notebook, I want to explain how to use the ROUGE metrics to measure the quality of the summaries produced by the newest large language models.\n",
    "\n",
    "If you're looking for more detailed explanations, you can refer to the article: [Rouge Metrics: Evaluating Summaries in Large Language Models](https://medium.com/towards-artificial-intelligence/rouge-metrics-evaluating-summaries-in-large-language-models-d200ee7ca0e6)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Load the Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.status.busy": "2024-03-24T14:48:26.984979Z",
     "iopub.status.idle": "2024-03-24T14:48:26.985422Z",
     "shell.execute_reply": "2024-03-24T14:48:26.985202Z",
     "shell.execute_reply.started": "2024-03-24T14:48:26.985183Z"
    }
   },
   "outputs": [],
   "source": [
    "#Import generic libraries\n",
    "import numpy as np \n",
    "import pandas as pd\n",
    "import torch\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The dataset is available on Kaggle and comprises a collection of technological news articles compiled by MIT. The article text is located in the 'Article Body' column.\n",
    "\n",
    "https://www.kaggle.com/datasets/deepanshudalal09/mit-ai-news-published-till-2023"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.status.busy": "2024-03-24T14:48:26.986614Z",
     "iopub.status.idle": "2024-03-24T14:48:26.987032Z",
     "shell.execute_reply": "2024-03-24T14:48:26.986854Z",
     "shell.execute_reply.started": "2024-03-24T14:48:26.986834Z"
    }
   },
   "outputs": [],
   "source": [
    "news = pd.read_csv('/kaggle/input/mit-ai-news-published-till-2023/articles.csv')\n",
    "DOCUMENT=\"Article Body\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.status.busy": "2024-03-24T14:48:26.988315Z",
     "iopub.status.idle": "2024-03-24T14:48:26.988771Z",
     "shell.execute_reply": "2024-03-24T14:48:26.988585Z",
     "shell.execute_reply.started": "2024-03-24T14:48:26.988558Z"
    }
   },
   "outputs": [],
   "source": [
    "#Because it is just a course we select a small portion of News.\n",
    "MAX_NEWS = 3\n",
    "subset_news = news.head(MAX_NEWS)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.status.busy": "2024-03-24T14:48:26.989995Z",
     "iopub.status.idle": "2024-03-24T14:48:26.990467Z",
     "shell.execute_reply": "2024-03-24T14:48:26.99025Z",
     "shell.execute_reply.started": "2024-03-24T14:48:26.990229Z"
    }
   },
   "outputs": [],
   "source": [
    "subset_news.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.status.busy": "2024-03-24T14:48:26.993681Z",
     "iopub.status.idle": "2024-03-24T14:48:26.994789Z",
     "shell.execute_reply": "2024-03-24T14:48:26.994584Z",
     "shell.execute_reply.started": "2024-03-24T14:48:26.994562Z"
    }
   },
   "outputs": [],
   "source": [
    "articles = subset_news[DOCUMENT].tolist()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Load the Models and create the summaries\n",
    "\n",
    "Both models are available on Hugging Face, so we will work with the Transformers library."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.status.busy": "2024-03-24T14:48:26.995958Z",
     "iopub.status.idle": "2024-03-24T14:48:26.996385Z",
     "shell.execute_reply": "2024-03-24T14:48:26.99618Z",
     "shell.execute_reply.started": "2024-03-24T14:48:26.996162Z"
    }
   },
   "outputs": [],
   "source": [
    "import transformers\n",
    "from transformers import AutoTokenizer, AutoModelForSeq2SeqLM\n",
    "\n",
    "model_name_small = \"t5-base\"\n",
    "model_name_reference = \"flax-community/t5-base-cnn-dm\"\n",
    "#model_name_reference = \"pszemraj/long-t5-tglobal-base-16384-booksum-V11-big_patent-V2\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.status.busy": "2024-03-24T14:48:26.997626Z",
     "iopub.status.idle": "2024-03-24T14:48:26.99803Z",
     "shell.execute_reply": "2024-03-24T14:48:26.997857Z",
     "shell.execute_reply.started": "2024-03-24T14:48:26.997838Z"
    }
   },
   "outputs": [],
   "source": [
    "#This function returns the tokenizer and the Model. \n",
    "def get_model(model_id):\n",
    "    tokenizer = AutoTokenizer.from_pretrained(model_id)\n",
    "    model = AutoModelForSeq2SeqLM.from_pretrained(model_id)\n",
    "    \n",
    "    return tokenizer, model\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.status.busy": "2024-03-24T14:48:26.999553Z",
     "iopub.status.idle": "2024-03-24T14:48:26.999951Z",
     "shell.execute_reply": "2024-03-24T14:48:26.999775Z",
     "shell.execute_reply.started": "2024-03-24T14:48:26.999755Z"
    }
   },
   "outputs": [],
   "source": [
    "tokenizer_small, model_small = get_model(model_name_small)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.status.busy": "2024-03-24T14:48:27.001219Z",
     "iopub.status.idle": "2024-03-24T14:48:27.001671Z",
     "shell.execute_reply": "2024-03-24T14:48:27.001489Z",
     "shell.execute_reply.started": "2024-03-24T14:48:27.001469Z"
    }
   },
   "outputs": [],
   "source": [
    "tokenizer_reference, model_reference = get_model(model_name_reference)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With both models downloaded and ready, we create a function that will perform the summaries.\n",
    "\n",
    "The function takes fourth parameters:\n",
    "\n",
    "* the list of texts to summarize.\n",
    "* the tokenizer.\n",
    "* the model.\n",
    "* the maximum length for the generated summary"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.status.busy": "2024-03-24T14:48:27.004057Z",
     "iopub.status.idle": "2024-03-24T14:48:27.005522Z",
     "shell.execute_reply": "2024-03-24T14:48:27.005306Z",
     "shell.execute_reply.started": "2024-03-24T14:48:27.005285Z"
    }
   },
   "outputs": [],
   "source": [
    "def create_summaries(texts_list, tokenizer, model, max_l=125):\n",
    "    \n",
    "    # We are going to add a prefix to each article to be summarized \n",
    "    # so that the model knows what it should do\n",
    "    prefix = \"Summarize this news: \"  \n",
    "    summaries_list = [] #Will contain all summaries\n",
    "\n",
    "    texts_list = [prefix + text for text in texts_list]\n",
    "    \n",
    "    for text in texts_list:\n",
    "        \n",
    "        summary=\"\"\n",
    "        \n",
    "        #calculate the encodings\n",
    "        input_encodings = tokenizer(text, \n",
    "                                    max_length=1024, \n",
    "                                    return_tensors='pt', \n",
    "                                    padding=True, \n",
    "                                    truncation=True)\n",
    "\n",
    "        # Generate summaries\n",
    "        with torch.no_grad():\n",
    "            output = model.generate(\n",
    "                input_ids=input_encodings.input_ids,\n",
    "                attention_mask=input_encodings.attention_mask,\n",
    "                max_length=max_l,  # Set the maximum length of the generated summary\n",
    "                num_beams=2,     # Set the number of beams for beam search\n",
    "                early_stopping=True\n",
    "            )\n",
    "            \n",
    "        #Decode to get the text\n",
    "        summary = tokenizer.batch_decode(output, skip_special_tokens=True)\n",
    "        \n",
    "        #Add the summary to summaries list \n",
    "        summaries_list += summary\n",
    "    return summaries_list \n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To create the summaries, we call the 'create_summaries' function, passing both the news articles and the corresponding tokenizer and model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.status.busy": "2024-03-24T14:48:27.006652Z",
     "iopub.status.idle": "2024-03-24T14:48:27.007493Z",
     "shell.execute_reply": "2024-03-24T14:48:27.007225Z",
     "shell.execute_reply.started": "2024-03-24T14:48:27.007195Z"
    }
   },
   "outputs": [],
   "source": [
    "# Creating the summaries for both models. \n",
    "summaries_small = create_summaries(articles, \n",
    "                                  tokenizer_small, \n",
    "                                  model_small)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.status.busy": "2024-03-24T14:48:27.00885Z",
     "iopub.status.idle": "2024-03-24T14:48:27.009239Z",
     "shell.execute_reply": "2024-03-24T14:48:27.009065Z",
     "shell.execute_reply.started": "2024-03-24T14:48:27.009046Z"
    }
   },
   "outputs": [],
   "source": [
    "summaries_reference = create_summaries(articles, \n",
    "                                      tokenizer_reference, \n",
    "                                      model_reference)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.status.busy": "2024-03-24T14:48:27.010949Z",
     "iopub.status.idle": "2024-03-24T14:48:27.01145Z",
     "shell.execute_reply": "2024-03-24T14:48:27.011234Z",
     "shell.execute_reply.started": "2024-03-24T14:48:27.011213Z"
    }
   },
   "outputs": [],
   "source": [
    "summaries_small"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.status.busy": "2024-03-24T14:48:27.013596Z",
     "iopub.status.idle": "2024-03-24T14:48:27.014012Z",
     "shell.execute_reply": "2024-03-24T14:48:27.013828Z",
     "shell.execute_reply.started": "2024-03-24T14:48:27.013809Z"
    }
   },
   "outputs": [],
   "source": [
    "summaries_reference"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "At first glance, it's evident that the summaries are different. \n",
    "\n",
    "However, it's challenging to determine which one is better. \n",
    "\n",
    "It's even difficult to discern whether they are significantly distinct or if there are just subtle differences between them.\n",
    "\n",
    "This is what we are going to verify now using ROUGE. When comparing the summaries of one model with those of the other, we don't get an idea of which one is better, but rather an idea of how much the summaries have changed with the fine-tuning applied to the model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# ROUGE\n",
    "Let's install and load all the necessary libraries to conduct a ROUGE evaluation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-02-11T16:22:26.695431Z",
     "iopub.status.busy": "2024-02-11T16:22:26.694934Z",
     "iopub.status.idle": "2024-02-11T16:22:58.234355Z",
     "shell.execute_reply": "2024-02-11T16:22:58.233034Z",
     "shell.execute_reply.started": "2024-02-11T16:22:26.695391Z"
    }
   },
   "outputs": [],
   "source": [
    "!pip install evaluate\n",
    "!pip install rouge_score"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-02-11T16:22:58.23667Z",
     "iopub.status.busy": "2024-02-11T16:22:58.236275Z",
     "iopub.status.idle": "2024-02-11T16:23:01.759691Z",
     "shell.execute_reply": "2024-02-11T16:23:01.758201Z",
     "shell.execute_reply.started": "2024-02-11T16:22:58.236633Z"
    }
   },
   "outputs": [],
   "source": [
    "import evaluate\n",
    "from nltk.tokenize import sent_tokenize\n",
    "\n",
    "#!pip install rouge"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-02-11T16:23:01.761536Z",
     "iopub.status.busy": "2024-02-11T16:23:01.761206Z",
     "iopub.status.idle": "2024-02-11T16:23:01.769416Z",
     "shell.execute_reply": "2024-02-11T16:23:01.767623Z",
     "shell.execute_reply.started": "2024-02-11T16:23:01.761507Z"
    }
   },
   "outputs": [],
   "source": [
    "#import evaluate\n",
    "#from nltk.tokenize import sent_tokenize\n",
    "#from rouge_score import rouge_scorer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-02-11T16:23:01.771547Z",
     "iopub.status.busy": "2024-02-11T16:23:01.771216Z",
     "iopub.status.idle": "2024-02-11T16:23:03.461674Z",
     "shell.execute_reply": "2024-02-11T16:23:03.458933Z",
     "shell.execute_reply.started": "2024-02-11T16:23:01.771518Z"
    }
   },
   "outputs": [],
   "source": [
    "#With the function load of the library evaluate \n",
    "#we create a rouge_score object\n",
    "rouge_score = evaluate.load(\"rouge\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Calculating ROUGE is as simple as calling the *compute* function of the *rouge_score* object we created earlier. This function takes the texts to compare as arguments and a third value *use_stemmer*, which indicates whether it should use *stemmer* or full words for the comparison.\n",
    "\n",
    "A *stemmer* is the base of the word. Transform differents forms of a word in a same base. \n",
    "\n",
    "Some samples of steammer are: \n",
    "* Jumping -> Jump. \n",
    "* Running -> Run. \n",
    "* Cats -> Cat. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-02-11T16:23:03.465707Z",
     "iopub.status.busy": "2024-02-11T16:23:03.464659Z",
     "iopub.status.idle": "2024-02-11T16:23:03.47552Z",
     "shell.execute_reply": "2024-02-11T16:23:03.473634Z",
     "shell.execute_reply.started": "2024-02-11T16:23:03.465649Z"
    }
   },
   "outputs": [],
   "source": [
    "def compute_rouge_score(generated, reference):\n",
    "    \n",
    "    #We need to add '\\n' to each line before send it to ROUGE\n",
    "    generated_with_newlines = [\"\\n\".join(sent_tokenize(s.strip())) for s in generated]\n",
    "    reference_with_newlines = [\"\\n\".join(sent_tokenize(s.strip())) for s in reference]\n",
    "    \n",
    "    return rouge_score.compute(\n",
    "        predictions=generated_with_newlines,\n",
    "        references=reference_with_newlines,\n",
    "        use_stemmer=True,\n",
    "        \n",
    "    )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-02-11T16:23:03.479396Z",
     "iopub.status.busy": "2024-02-11T16:23:03.478135Z",
     "iopub.status.idle": "2024-02-11T16:23:03.783088Z",
     "shell.execute_reply": "2024-02-11T16:23:03.781362Z",
     "shell.execute_reply.started": "2024-02-11T16:23:03.479343Z"
    }
   },
   "outputs": [],
   "source": [
    "compute_rouge_score(summaries_small, summaries_reference)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that there is a difference between the two models when performing summarization. \n",
    "\n",
    "For example, in ROUGE-1, the similarity is 47%, while in ROUGE-2, it's a 32%. This indicates that the results are different, with some similarities but differents enough. \n",
    "\n",
    "However, we still don't know which model is better since we have compared them to each other and not to a reference text. But at the very least, we know that the fine-tuning process applied to the second model has significantly altered its results."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-08-07T17:21:28.296636Z",
     "iopub.status.busy": "2023-08-07T17:21:28.296173Z",
     "iopub.status.idle": "2023-08-07T17:21:28.301906Z",
     "shell.execute_reply": "2023-08-07T17:21:28.300702Z",
     "shell.execute_reply.started": "2023-08-07T17:21:28.296601Z"
    }
   },
   "source": [
    "# Comparing to a Dataset with real summaries. \n",
    "We are going to load the Dataset cnn_dailymail. This is a well-known dataset available in the **Datasets** library, and it suits our purpose perfectly. \n",
    "\n",
    "Apart from the news, it also contains pre-existing summaries. \n",
    "\n",
    "We will compare the summaries generated by the two models we are using with those from the dataset to determine which model creates summaries that are closer to the reference ones."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-02-11T16:23:03.78564Z",
     "iopub.status.busy": "2024-02-11T16:23:03.785171Z",
     "iopub.status.idle": "2024-02-11T16:23:17.746392Z",
     "shell.execute_reply": "2024-02-11T16:23:17.744439Z",
     "shell.execute_reply.started": "2024-02-11T16:23:03.785598Z"
    }
   },
   "outputs": [],
   "source": [
    "!pip install -q datasets==2.1.0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-02-11T16:23:17.749368Z",
     "iopub.status.busy": "2024-02-11T16:23:17.748834Z",
     "iopub.status.idle": "2024-02-11T16:23:19.37989Z",
     "shell.execute_reply": "2024-02-11T16:23:19.378417Z",
     "shell.execute_reply.started": "2024-02-11T16:23:17.749323Z"
    }
   },
   "outputs": [],
   "source": [
    "from datasets import load_dataset\n",
    "\n",
    "cnn_dataset = load_dataset(\n",
    "    \"ccdv/cnn_dailymail\", version=\"3.0.0\"\n",
    ")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-02-11T16:23:19.38356Z",
     "iopub.status.busy": "2024-02-11T16:23:19.382377Z",
     "iopub.status.idle": "2024-02-11T16:23:19.39789Z",
     "shell.execute_reply": "2024-02-11T16:23:19.396514Z",
     "shell.execute_reply.started": "2024-02-11T16:23:19.383515Z"
    }
   },
   "outputs": [],
   "source": [
    "#Get just a few news to test\n",
    "sample_cnn = cnn_dataset[\"test\"].select(range(MAX_NEWS))\n",
    "\n",
    "sample_cnn"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We retrieve the maximum length of the summaries to give the models the option to generate summaries of the same length, if they choose to do so."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-02-11T16:23:19.400354Z",
     "iopub.status.busy": "2024-02-11T16:23:19.399916Z",
     "iopub.status.idle": "2024-02-11T16:23:19.412029Z",
     "shell.execute_reply": "2024-02-11T16:23:19.410506Z",
     "shell.execute_reply.started": "2024-02-11T16:23:19.400321Z"
    }
   },
   "outputs": [],
   "source": [
    "max_length = max(len(item['highlights']) for item in sample_cnn)\n",
    "max_length = max_length + 10"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-02-11T16:23:19.414594Z",
     "iopub.status.busy": "2024-02-11T16:23:19.41414Z",
     "iopub.status.idle": "2024-02-11T16:23:47.08428Z",
     "shell.execute_reply": "2024-02-11T16:23:47.082408Z",
     "shell.execute_reply.started": "2024-02-11T16:23:19.414556Z"
    }
   },
   "outputs": [],
   "source": [
    "summaries_t5_base = create_summaries(sample_cnn[\"article\"], \n",
    "                                      tokenizer_small, \n",
    "                                      model_small, \n",
    "                                      max_l=max_length)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-02-11T16:23:47.086742Z",
     "iopub.status.busy": "2024-02-11T16:23:47.086282Z",
     "iopub.status.idle": "2024-02-11T16:24:11.414682Z",
     "shell.execute_reply": "2024-02-11T16:24:11.41374Z",
     "shell.execute_reply.started": "2024-02-11T16:23:47.086698Z"
    }
   },
   "outputs": [],
   "source": [
    "summaries_t5_finetuned = create_summaries(sample_cnn[\"article\"], \n",
    "                                      tokenizer_reference, \n",
    "                                      model_reference, \n",
    "                                      max_l=max_length)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-02-11T16:24:11.417026Z",
     "iopub.status.busy": "2024-02-11T16:24:11.416452Z",
     "iopub.status.idle": "2024-02-11T16:24:11.422096Z",
     "shell.execute_reply": "2024-02-11T16:24:11.421154Z",
     "shell.execute_reply.started": "2024-02-11T16:24:11.416994Z"
    }
   },
   "outputs": [],
   "source": [
    "#Get the real summaries from the cnn_dataset\n",
    "real_summaries = sample_cnn['highlights']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's take a look at the generated summaries alongside the reference summaries provided by the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-02-11T16:24:11.423732Z",
     "iopub.status.busy": "2024-02-11T16:24:11.42325Z",
     "iopub.status.idle": "2024-02-11T16:24:11.44526Z",
     "shell.execute_reply": "2024-02-11T16:24:11.443972Z",
     "shell.execute_reply.started": "2024-02-11T16:24:11.4237Z"
    }
   },
   "outputs": [],
   "source": [
    "summaries = pd.DataFrame.from_dict(\n",
    "        {\n",
    "            \"base\": summaries_t5_base, \n",
    "            \"finetuned\": summaries_t5_finetuned,\n",
    "            \"reference\": real_summaries,\n",
    "        }\n",
    "    )\n",
    "summaries.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can calculate the ROUGE scores for the two models."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-02-11T16:24:11.447509Z",
     "iopub.status.busy": "2024-02-11T16:24:11.44714Z",
     "iopub.status.idle": "2024-02-11T16:24:11.708388Z",
     "shell.execute_reply": "2024-02-11T16:24:11.707032Z",
     "shell.execute_reply.started": "2024-02-11T16:24:11.447478Z"
    }
   },
   "outputs": [],
   "source": [
    "compute_rouge_score(summaries_t5_base, real_summaries)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-02-11T16:24:11.710987Z",
     "iopub.status.busy": "2024-02-11T16:24:11.71049Z",
     "iopub.status.idle": "2024-02-11T16:24:11.979451Z",
     "shell.execute_reply": "2024-02-11T16:24:11.978018Z",
     "shell.execute_reply.started": "2024-02-11T16:24:11.710939Z"
    }
   },
   "outputs": [],
   "source": [
    "compute_rouge_score(summaries_t5_finetuned, real_summaries)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With these results, I would say that the fine-tuned model performs slightly better than the T5-Base model. It consistently achieves higher ROUGE scores in all metrics except for LSUM, where the difference is minimal.\n",
    "\n",
    "Additionally, the ROUGE metrics are quite interpretable. \n",
    "\n",
    "LSUM indicates the percentage of the longest common subsequence, regardless of word order, in relation to the total length of the text. \n",
    "\n",
    "This can be a good indicator of overall similarity between texts. However, both models have very similar LSUM scores, and the fine-tuned model has better scores in other ROUGE metrics.\n",
    "\n",
    "Personally, I would lean towards the fine-tuned model, although the difference may not be very significant.\n",
    "\n",
    "## Continue learning\n",
    "This notebook is part of a [course on large language models](https://github.com/peremartra/Large-Language-Model-Notebooks-Course) I'm working on and it's available on [GitHub](https://github.com/peremartra/Large-Language-Model-Notebooks-Course). You can see the other lessons and if you like it, don't forget to subscribe to receive notifications of new lessons.\n",
    "\n",
    "Other notebooks in the Large Language Models series: \n",
    "https://www.kaggle.com/code/peremartramanonellas/use-a-vectorial-db-to-optimize-prompts-for-llms\n",
    "https://www.kaggle.com/code/peremartramanonellas/ask-your-documents-with-langchain-vectordb-hf\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Feel Free to fork or edit the noteboook for you own convenience. Please consider ***UPVOTING IT***. It helps others to discover the notebook, and it encourages me to continue publishing."
   ]
  }
 ],
 "metadata": {
  "kaggle": {
   "accelerator": "none",
   "dataSources": [
    {
     "datasetId": 3496946,
     "sourceId": 6104553,
     "sourceType": "datasetVersion"
    }
   ],
   "dockerImageVersionId": 30527,
   "isGpuEnabled": false,
   "isInternetEnabled": true,
   "language": "python",
   "sourceType": "notebook"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
